Remove duplicates from Pandas Dataframe

05, July 2019

DataFrame.duplicated()

In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns.

Create a Dataframe with some duplicate rows.

In [18]:
import pandas as pd
students = [('Vishal', 34, 'Sydney'),
            ('Gaurav', 30, 'Delhi'),
            ('Madhura', 16, 'New York'),
            ('Vishakha', 30, 'Delhi'),
            ('Vishakha', 30, 'Delhi'),
            ('Vishakha', 30, 'Mumbai'),
            ('Madhura', 40, 'London'),
            ('Manavika', 30, 'Delhi')
           ]


# Create a DataFrame object
students_df = pd.DataFrame(students, columns=['Name', 'Age', 'City'])
students_df
Out[18]:
Name Age City
0 Vishal 34 Sydney
1 Gaurav 30 Delhi
2 Madhura 16 New York
3 Vishakha 30 Delhi
4 Vishakha 30 Delhi
5 Vishakha 30 Mumbai
6 Madhura 40 London
7 Manavika 30 Delhi

Find Duplicate Rows based on all columns

To find & select the duplicates in all rows based on all columns call the Daraframe.duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence. Now pass this Boolean Series to List[] operator of Dataframe to select the rows which are duplicate i.e.

In [15]:
# Select duplicate rows except first occurrence based on all columns
students_df.duplicated()
duplicateRowsDF = students_df[students_df.duplicated()]

print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)
Duplicate Rows except first occurrence based on all columns are :
       Name  Age   City
4  Vishakha   30  Delhi

Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.

If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’ i.e.

In [16]:
# Select duplicate rows except last occurrence based on all columns
duplicateRowsDF = students_df[students_df.duplicated(keep='last')]
 
print("Duplicate Rows except last occurrence based on all columns are :")
print(duplicateRowsDF)
Duplicate Rows except last occurrence based on all columns are :
       Name  Age   City
3  Vishakha   30  Delhi

Find Duplicate Rows based on selected columns

Compare rows & find duplicates based on the selected columns. For this, pass the list of column names in the subset argument: Dataframe.duplicate() function. It will select & return duplicate rows based on these passed columns only.

Find & select rows based on a single column:

In [17]:
duplicateRowsbyCol = students_df[students_df.duplicated(['Name'])]
 
print("Duplicate Rows based on a single column are:", duplicateRowsbyCol, sep='\n')
Duplicate Rows based on a single column are:
       Name  Age    City
4  Vishakha   30   Delhi
5  Vishakha   30  Mumbai
6   Madhura   40  London